Skip to content

[#38390] Bug fix for Datetime64Formatter with values of ndim > 1 #38391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

BryanCutler
Copy link
Contributor

@BryanCutler BryanCutler commented Dec 9, 2020

This change fixes a bug in Datetime64Formatter that errors with values of ndim > 1. This was discovered when using an ExtensionArray with 2D values of datetime64 and attempting to format/display. The formatter currently allows for 2D values, but will end up returning nested strings instead of a flat list of string. The same error occurs when using the Datetime64Formatter directly on 2D values.

The fix is to flatten the values to get a flat list of formatted datetime64 strings, then reshape, and then format a second time over the outer dimension to produce the properly nested list format and the final flat list of strings.

@@ -3106,6 +3106,43 @@ def format_func(x):
result = formatter.get_result()
assert result == ["10:10", "12:12"]

def test_datetime64formatter_2d_array(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need any tz-aware cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good idea. It's likely to fail the same way so I'll add that too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if not isinstance(values, DatetimeIndex):
values = DatetimeIndex(values)
if not isinstance(flat_values, DatetimeIndex):
flat_values = DatetimeIndex(flat_values)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DatetimeIndex appears to only support 1-d values. It will actually construct with dim > 1 but can error at some places, so pass flattened values just to be safe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use DatetimeArray?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is a little cleaner since DatetimeArray._format_native_types is the only thing being used here. It still might be good to call values.ravel() here since it's needed if self.formatter is set, although not really necessary otherwise.

flat_str_values = np.array([self.formatter(x) for x in flat_values])
fmt_values = flat_str_values.reshape(values.shape)
else:
fmt_values = flat_values._data._format_native_types(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will actually call ravel()/reshape() itself, but flattened values are needed above so easier to use that here as well.

@BryanCutler
Copy link
Contributor Author

Thanks for the review @jbrockmendel , sorry for the delay. I updated with a fix and tests for Datetime64TZFormatter as well, since the same bug happens.

@BryanCutler BryanCutler force-pushed the bug-ext-array-2d-datetime64-38390 branch from 75c381c to da0f037 Compare December 15, 2020 19:17
@BryanCutler
Copy link
Contributor Author

The remaining test failures look like they are unrelated from a network timeout

values = DatetimeIndex(values)
values = np.asarray(self.values)
flat_values = values.ravel() if values.ndim > 1 else values
flat_values = DatetimeArray(flat_values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DatetimeArray can handle 2D values, so some of the ravel/ndim checks might not be needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I noticed it's able to handle 2D values, but the extension array that I was fixing this for can have higher dimensions, so it would still need to check above that plus when self.formatter is set. It seems cleaner to flatten at the beginning and work with flattened values. And I think calling ravel() a second time would be a no-op?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK no problem

@@ -242,6 +242,7 @@ I/O
- Bug in :func:`read_csv` not accepting ``usecols`` with different length than ``names`` for ``engine="python"`` (:issue:`16469`)
- Bug in :func:`read_csv` raising ``TypeError`` when ``names`` and ``parse_dates`` is specified for ``engine="c"`` (:issue:`33699`)
- Allow custom error values for parse_dates argument of :func:`read_sql`, :func:`read_sql_query` and :func:`read_sql_table` (:issue:`35185`)
- Bug in :class:`Datetime64Formatter` that caused error on string representation with extension types of datetime64 values and ndim > 1 (:issue:`38390`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not user facing (as its internal), is there anything that is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug was found when calling repr() on an ExtensionType, I will put this in terms of that instead.

if not isinstance(values, DatetimeIndex):
values = DatetimeIndex(values)
values = np.asarray(self.values)
flat_values = values.ravel() if values.ndim > 1 else values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use the @ravel_compat here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that would work. The only thing is that if self.formatter is set, then it would add an unnecessary conversion to an numpy.ndarray and then back again to list. I'm not sure how crucial that case is, would you like me to go ahead and change it anyway?


if self.formatter is not None and callable(self.formatter):
return [self.formatter(x) for x in values]
fmt_values = [self.formatter(x) for x in flat_values]
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we not using just the _format_native_types here? this makes this way more complicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, which line are you referring to? Did you mean instead of calling self.formatter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue is that self.formatter may be user-supplied at some point in the process?

@jreback jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Output-Formatting __repr__ of pandas objects, to_string labels Dec 22, 2020
@jreback
Copy link
Contributor

jreback commented Dec 22, 2020

more importantly, shouldn't this actually be handled on the EA itself?

@BryanCutler
Copy link
Contributor Author

Thanks for reviewing @jreback , I just had a couple questions.

more importantly, shouldn't this actually be handled on the EA itself?

This would be preferred, but I could not find a way to do this. When calling repr() on a Series wrapping my ExtensionArray, the formatting goes like this:

  1. Series sees the dtype is an ExtensionType and ends up in ExtensionArrayFormatter._format_strings()
  2. ExtensionArrayFormatter._format_strings() converts array to numpy.ndarray
  3. Datetime64Formatter formats the ndarray into a list of strings and returns

Is there a way for the ExtensionArray to tie into this somehow?

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jan 28, 2021
@BryanCutler
Copy link
Contributor Author

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

I am still planning on working this. I noticed there is a similar problem with extension arrays with floating point values of ndim > 1, so I will work on a fix that will hopefully handle both of these cases soon.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 7, 2021

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Mar 7, 2021
@BryanCutler
Copy link
Contributor Author

I haven't had the time to get back to this, will close for now and reopen when I can update

@BryanCutler BryanCutler closed this Mar 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Output-Formatting __repr__ of pandas objects, to_string Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: ExtensionArray with 2D datetime64 values errors on display formatting
4 participants